Toward the Automatic Generation of Cued Speech
نویسندگان
چکیده
Although Manual Cued Speech (MCS) can greatly facilitate both education and communication for the deaf, its use is limited to situations in which the talker, or a transliterator, is able to produce cues for the cue receiver. The availability of automatically produced cues would substantially relax this restriction. However, it is unclear whether current automatic speech recognition (ASR) technology would be adequate for producing cues automatically. To evaluate this adequacy, we measured the speech reception scores achieved by highly experienced receivers of MCS when cues were produced by one actual and several hypothesized ASR systems, as well with speechreading alone and with MCS. The systems studied modelled the effects of various types of errors and delays associated with the recognition process, and included a representative state-of-the-art, speaker-dependent phonetic recognizer. Results indicate that while current speaker-independent recognition technology probably would not provide useful cues, the cues provided by a speaker-dependent ASR would aid speech reception substantially. The benefit provided by automatically generated cues is heavily dependent on the use of an effective visual display that minimizes the effects of recognition delays so that cues are perceived in synchrony with facial actions. Introduction Manual Cued Speech (MCS; Cornett, 1967) is a visual display of a discrete symbolic code used in conjunction with speechreading in everyday environments. The system can be taught to very young deaf children and greatly facilitates both language learning and general education. Evaluations of deaf children who use MCS by Kaplan (1974)], Ling and Clarke (1975), Clarke and Ling (1976), and Nicholls and Ling 1982) have documented the benefits provided by the system. The results reported in the latter study were particularly dramatic. Eighteen senior-class children who had used MCS for an average of 7.3 years were tested for reception of words in both Low Predictability (LP) and High-Predictability (HP) sentences similar to those used in the SPIN test (Kalikow, Stevens, & Elliot, 1977). Keyword scores improved from 25.5% (LP) and 32.0% (HP) correct for speechreading alone to 96.6% (LP) and 96.2% (HP) correct when speechreading was supplemented by cues. In a more recent study, Uchanski et al. (1994) tested four highly experienced young-adult users of MCS. The subjects were 18-27 years of age and had acquired hearing losses before the age of two years. Test materials were videotaped Harvard sentences (IEEE, 1969), which provide few contextual cues and consequently pose considerably greater difficulty than everyday conversational speech. Using speechreading alone, the subjects were able to recognize only 25% of the keywords in the sentences. With MCS, however, keyword scores increased to 84%. Although MCS provides enormous benefit in the education of the deaf, the assistance it provides in day-to-day communication is limited to situations in which the talker, or a transliterator, produces the cues. To overcome this limitation, Cornett and his colleagues attempted to develop an "Autocuer," a system that would derive cues similar to those of MCS automatically, by electronic analysis of the acoustic speech signal, and display them visually to the cue receiver. The following sections provide a review of the systems that resulted from this development effort, as an introduction to the research reported in this paper. Automatically Generated Cues In the studies to be considered, cues are derived by automatic speech recognition (ASR) technology. Therefore, the usefulness of the cues, evaluated in terms of the improvement in speechreading scores, is strongly dependent on the performance of the ASR system. In some cases this performance is reported in terms of the percentage of phonetic units (phones) in the acoustic waveform that are correctly identified by the recognizer. In computing this measure, phones identified as other phones (substitutions) and phones whose occurrence is not detected (deletions) are counted as errors. However, the phonetic transcription produced by a recognizer may also include phones not present in the speech waveform (insertions). Unlike the "correct identification" score, which ignores insertions, the number of insertions is subtracted from the number of correct identifications when computing an "accuracy" score for the recognizer. Since this terminology has become standard only relatively recently, the interpretation of some older reports of recognizer performance is problematic. Of course, strict comparisons of performance between different systems are also very difficult to make because of differences in the databases, training regimens, and speech materials employed. The Gallaudet-R.T.I. Autocuer Under Cornett's leadership, Gallaudet University and the Research Triangle Institute undertook the development of automatic cueing systems. The best described implementation of the Gallaudet-R.T.I. Autocuer (Cornett, Beadles, & Wilson, 1977) displayed the cues as virtual images of a pair of seven-segment LED elements projected in the viewing field of an eyeglass lens. By activating the segments selectively, nine distinct symbol shapes, cueing consonant distinctions, were created at each of four positions, cueing vowel distinctions. The cue groups selected differed from those used in MCS, a concession to the difficulties of speech recognition, but, as in MCS, sounds that are difficult to distinguish through speechreading occurred in different cue groups. A block diagram of the speech analysis subsystem proposed for the wearable version of the Autocuer suggests that speech sounds would be assigned to cue groups on the basis of estimates of the voice pitch as well as measurements of zero crossing rates and peak-to-peak amplitudes in lowand high-frequency bands of speech. To distinguish cue display limitations from speech analyzer limitations, Cornett et al. (1977) conducted speech reception tests both with "ideal" cues derived manually from spectrograms, as well as with cues derived automatically by the analyzer. Subjects included normal hearing listeners and listeners with severe to profound hearing losses. Since none of these subjects had experience with Manual Cued Speech, they received roughly 40 hours of training with the system before participating in speech reception tests. On tests of the reception of common words spoken in isolation, scores improved from 63% for speechreading alone to 84% when spectrogram-derived cues were presented. By comparison, when analyzer-derived cues were presented, scores were essentially the same as in the unaided condition. Tests of the reception of common words in short phrases, however, indicated that neither spectrogramnor analyzer-derived cues improved scores significantly. The authors interpreted the results with isolated words as highly encouraging. They attributed the poor performance seen for words in phrases to an inability of the subjects to integrate the cues with speechreading. The authors concluded that the subjects might require as much as eight months of exposure to the cues to decode cued sentences spoken at normal speaking rates, as had been suggested by the findings of Clarke and Ling (1976). Cornett et al. (1977) regarded the results of their study as supporting the further development of a wearable prototype of the Autocuer. Assuming that the error rate of the speech analyzer component could be improved sufficiently, the availability of such a system would permit subjects to train on automatically produced cues. They speculated that until sufficient consistent exposure to such cues could be provided to cue receivers it would be impossible to evaluate the performance of the system realistically. Although a wearable prototype of the Autocuer has been produced, it is not in use at the present time, and development of this system does not appear to have progressed as planned. Evaluations of the most recent implementation of this Autocuer (Beadles, 1989) indicate that the phoneme identification score is roughly 54%, with a 33% deletion rate and a 13% substitution rate. Simulation studies of the reception of common words spoken in isolation (Beadles, 1989) indicate that this level of recognition performance would lead to scores of roughly 67% correct, only eight percentage points higher than for speechreading alone on the task. Thus the Autocuer's low phoneme identification score, particularly its high deletion rate, probably would not improve communication greatly relative to speechreading alone. The Cornett et al. (1977) study is discussed further on page 26. Potential Recognition Systems Although there have been many recent attempts to develop auditory (e.g,, Grant, Braida, & Renn, 1991), tactile (Reed & Durlach, 1995; Boothroyd, Kishon-Rabin, & Waldstein, 1995; Reed & Delhorne, 1995; Plant, Franklin, Franklin, & Steele, 1995; Weisenberger, 1995; Weisenberger, Broadstone, & Saunders, 1989), and visual (Ebrahimi & Kunov, 1991) speechreading supplements, active research on automatic cueing systems has largely been in abeyance since the Cornett et al. (1977) study. In 1988 Vilaclara described a speech processing system designed to recognize phonemes and produce 11 cues, called "kinemes," complementary to the speechreading of French. This system assigns consonants to one of six kineme groups and vowels to one of five kineme groups appropriate for French. When evaluated on sentences consisting of 450 consonant segments and 603 vowel segments produced by a single speaker, this system achieved an identification score for phones of 74%, corresponding to an identification score for kinemes of 79%. The significance of these results is unclear because of the relatively small amount of speech processed (roughly 50 sentences), and the lack of data on the occurrence of deletions and insertions. No evaluation of this system as a speechreading supplement has been reported. Research on speech recognizers usually does not focus on deriving cues for aiding speechreading. In order to estimate the performance that might be achieved with current and future ASR technology, Uchanski et al. (1994) measured how well cues could be derived by a contemporary phonetic speech recognition system (Zue, Glass, Phillips, & Seneff, 1989) operated as a phonetic recognizer. The system was trained using both a multi-speaker corpus derived from the TIMIT database (Zue & Seneff, 1988) and also a single-speaker corpus of Harvard sentences (IEEE, 1969). The test materials were segmented prior to recognition, making the recognizer's task solely one of classification, so that the only possible errors were substitutions of one phone for another. The results on the single speaker corpus indicated that the recognizer distinguished consonant segments from vowel segments 94% of the time. Consonants were assigned correctly to one of six cue groups 77% of the time. The eleven monophthongs were labelled correctly 65% of the time. Although no attempt was made to assign vowels to cue groups, it is clear that they would have been assigned correctly at least 65% of the time. Although the results obtained by both Vilaclara (1988) and by Uchanski et al. (1994) seem superior to those reported for the Gallaudet-R.T.I. recognizer, it is not clear whether this difference is significant. Indeed, it remains an open question to determine how accurate an automatic speech recognition system must be to be of benefit to speechreaders. Not all the errors made by a phonetic speech recognizer would be problematic for the cue receiver; for example, misidentification of one phoneme as another phoneme in the same cue group would have no effect since it would produce the correct cue symbol. It may also be possible for cue receivers to discern systematic patterns of cue errors with practice, and learn to deal with them. To deal with such uncertainties, Uchanski et al. (1994) used the Post-Labelling Model of audiovisual integration (Braida, 1991) to estimate the effectiveness of recognizer produced cues in speechreading. This model has been shown to predict audiovisual confusions of consonant and vowel segments from knowledge of the confusions made separately in audio-only and visual-only presentation conditions. In this application Uchanski et al. (1994) attempted to predict how well consonant and vowel segments would be identified when speechread segments were accompanied by ASR produced cues. Confusion patterns for consonants presented visually were derived from the data of Erber (1972); for vowels from the data of Hack and Erber (1982), Wozniak and Jackson (1979), and Montgomery and Jackson (1983). Confusion matrices characteristic of speech recognizer performance were derived from the data of Vilaclara and from tests of the MIT recognition system (Zue et al., 1989) operating as a phonetic recognizer. Predicted consonant identification scores for speechreading supplemented by the output of the automatic recognizers generally exceeded 75% correct, roughly 25 percentage points higher than with speechreading alone. Similar trends were obtained for vowels, although overall scores were lower and the improvements associated with the use of cues were not as large. In order to determine whether automatically generated cues can support useful levels of reception for sentences and continuous discourse, Uchanski et al. (1994) used a mathematical model developed by Boothroyd and Nittrouer (1988) to relate phoneme scores to scores for words and whole sentences. They estimated that a level of ASR accuracy in the range of 70-80% correct for phonemes would account for the performance of trained cue receivers on a variety of sentence tests. This range is close to that achieved by the better speech recognition systems currently in existence. Effects of Cue Imperfections The analyses reported by Uchanski et al. (1994) are highly encouraging with respect to the potential use of current automatic speech recognition technology for producing Cued Speech. However, they are not adequate to guide the development of an automatic cueing system. For example, Uchanski et al. (1994) considered only the effects of recognition errors on performance. Moreover, they assumed that successive recognizer outputs are statistically independent events, which they need not be, particularly when recognition techniques make use of constraints imposed by phonetic context. In MCS, cues are generally produced with extremely high accuracy. The effect of errors in cue production on the reception of continuous speech (as opposed to syllables or words) has not been studied. The importance of understanding the effect of errors on the reception of continuous speech is underscored by the findings of the Vidvox sensory aid project (Russell, 1986). In this aid continuous speech is represented by a stream of phonetic symbols without word boundary markers. Krasner et al. (1985) demonstrated that a Hidden Markov Model (HMM) recognizer considered for the Vidvox System could achieve a phonetic recognition accuracy of 65% (19% substitutions, 3% deletions, and 13% insertions) when operating in a speaker-dependent continuous-speech mode. Huggins et al. (1986) evaluated the effect of recognition errors on reception of phone representations of the Harvard (IEEE, 1969) sentences. One hearing-impaired subject was able to read perfectly transcribed speech materials at rates in excess of 75 wpm, and five normal-hearing subjects were able to read this material at 25-35 wpm. In the same study, recognition errors were simulated and presented to a normal-hearing listener at rates equal to 0, 1/3, 2/3 and 1 times a base rate (14% substitutions, 5% deletions and 6°/a insertions). The subject's reading rate declined from 69 wpm (0 errors), to 26 wpm (1/3 base error rate), 21 wpm (2/3) and 14 wpm (base error rate). Moreover, word reception errors, which were nearly absent in the error-free condition, increased to roughly 30°/a at the full error rate. These results suggest that even a relatively small number of errors can be extremely deleterious to continuous speech reception, particularly for users who are not extensively trained. For the Vidvox aid to be effective, the automatic recognition system must produce far fewer than 5% substitutions, 2% deletions and 2% insertions. This would require major improvements in ASR performance compared with currently available technology. Automatically produced cues differ from those of MCS in ways other than the incidence of errors. In MCS, cues are typically produced in synchrony with the facial actions that accompany speaking because the talker prepares the hand and mouth simultaneously. By contrast, the cues that would be produced by ASR systems typically lag behind facial actions because time is required to process the acoustic signal and to make recognition decisions. These times can vary from segment to segment. Moreover, ASR accuracy is expected to improve if greater time lags can be tolerated because more sophisticated recognition algorithms can be used, such as context-dependent algorithms which base the recognition of a given segment on tentative decisions about identity of surrounding segments. Finally, whereas the MCS code is based on a phonemic representation of what the speaker intends to say, automatic recognizers typically produce a phonetic representation of what is actually said. The phones that speakers produce often correspond to the intended phonemes only imperfectly and/or inconsistently. In principle, automatic recognizers can overcome the effects of poor or incorrect articulation by making use of higher order linguistic constraints (on word formation, grammatical structure, etc.). But this strategy requires delaying recognition until subsequent words or phrases are processed, so that cues are not produced until well after the occurrence of facial actions. Study of Simulated Cueing Systems In addition to the limitations on the performance of the speech recognition system, the cues presented by automatic cueing systems are likely to be displayed differently from those of MCS, even when the same cue groups are used. Methods A set experiments was designed to obtain greater insight into the effects that limitations on ASR performance would impose on automatic cueing systems. We generated the cues of MCS by simulating several existing or hypothesized phonetic ASR systems, displayed the cues as handshapes superimposed on a video display of the talker's face, and compared word reception in speechreading alone, with M'CS, and with the automatically generated cues. To minimize the need for training, the cue display was designed to resemble the appearance of manual cueing, the cue code was the same as that used in MCS, and all subjects were highly experienced receivers of MCS. Because the study was intended to be exploratory in nature, it was conducted in two phases. Although the phases differed with respect to experimental conditions and subjects, both phases included three baseline conditions: SA (speechreading alone), and MCS (Manual Cued Speech), the PSC (perfect synthetic cues) condition; and five (Phase I) or six (Phase II) experimental conditions that tested the effects of cue imperfections on sentence reception. The experimental conditions are described in the next section. Table 1 Characteristics of the subjects tested. Subject Age Deafness MCS Use Participation In Study (years) Etiology Onset (months) Past (years) Current (hours/day) Phases S1 22 Unknown 3 12 1-2 I & II S2 21 Unknown Birth 19 5-6 I & II S3 27 Rubella Birth 23 1-2 I & II S4 24 Unknown Birth 19 2-10 I S5 23 Unknown 18 17 <1 II S6 19 Unknown Birth 16 5-6 I Subjects Four subjects between the ages of 21-27 were employed in Phase I and five subjects between the ages of 19-27 in Phase 11 (see Table 1). All subjects were native English speakers who were profoundly, prelingually deafened. All were experienced cue receivers, having used MCS for 12-23 years. Their use of MCS at the time of the study ranged from 1-10 hours per day, usually with a parent or transliterator. Although some subjects used hearing aids, the testing used the video display only. Speech Materials Three sets of sentence materials were used in this research: CID (Davis & Silverman, 1970); Clarke (Magner, 1972); and Harvard (IEEE, 1969). The CID sentences, which are representative of "everyday" speech, and Clarke sentences, which are highly contextual, were used to evaluate speech reception These sentences contain five key words, four monosyllables and one disyllable. The Harvard sentences provide few contextual cues to word identity, e.g. "Glue the sheet to the dark blue background.",so that it is difficult to identify a word correctly simply from knowing the other words in the sentence. Videotaped recordings of these sentences were made using a female speaker (teacher of the deaf) who was proficient in producing MCS. The sentences were recorded both uncued and with the production of Manual Cued Speech. Speaking rates were roughly 100 WPM for the cued sentences and 140 WPM for the uncued sentences. Four of the subjects tested (S1, S2, S5 and S6) had previous exposure to this speaker, either at school or in other tests of speech reception. High quality audio and video recordings of the sentence materials had been prepared for earlier studies of speechreading and MCS. The sentences used for speechreading studies were recorded using a narrow angle camera lens so that the face filled roughly half the available viewing screen. The sentences used for MCS studies were recorded using a wide angle lens so that the face filled roughly one quarter of the video screen to accommodate the hand movements. Except for tests of the reception of Manual Cued Speech, which used the recordings of the cued sentences, the experiments used the recordings of the uncued sentences. High quality audio recordings of the acoustic waveform were made simultaneously with the video recordings. SMPTE (1995) time code was recorded on all materials for synchronization. The audio recordings were segmented and labelled by phonetically trained listeners who had access to spectrographic and orthographic representations of the sentences. These listeners marked the beginnings and ends of phones in the acoustic waveform and assigned labels to the phones based on the audio signal rather than the orthographic representations. The Sensimetrics Speechstation (1994) used for segmentation typically provided 5 msec resolution. Labelling employed the code used to designate phones in the TIMIT database (Zue & Seneff, 1988). These sequences of phonetic labels were used, after possible modification to simulate phonetic speech recognizers, to determine the cues that would be produced by various automatic cueing systems. Synthetic Cues The structure of MCS assumes that consonants are typically followed by vowels. It represents each CV combination as a handshape at a specified position. Special rules are used for atypical combinations such as VC, C, CC and CVC. We developed a regular expression grammar corresponding to these rules and implemented this grammar in software as a finite state machine. Given a sequence of phonetic labels, this machine specified a sequence of handshape-position pairs according to the rules of MCS. Images of the eight handshapes used in MCS were captured and edited from recordings of the speaker while producing MCS. These handshapes were then dubbed on a frame-by-frame basis onto the uncued narrow angle sentence recordings at appropriate positions around the face of the talker. Given the relatively large size of the face in these recordings, it was necessary to reduce the size of the synthetic handshapes so that all positions would fit on the screen. There was no fluid articulation or movement of the hand as it changed shape or position; only eight discrete shapes and five discrete positions were used. Appearance of the cue typically began at the frame corresponding to the start of the consonant segment in the acoustic waveform and was maintained until the end of the following vowel segment (for CV combinations) or of the consonant segment (for consonants occurring alone). CV combinations that included a diphthong were displayed as a pair of cues, with the duration divided equally between the two vowels of the diphthong. Recognizers Two phonetic automatic speech recognizers were used to derive the cues used in the speech reception tests. Both recognizers were variants of Hidden Markov Model (HMM) recognizers widely under study today. The recognizers differed with respect to the types of models used to represent the phonetic units being recognized. The Context Independent Recognizer used one model for each phone, so that, for example, the same model would have been used for the vowel sounds in "bib" and "kick". The Context Dependent Recognizer used different models for the same phone, depending on the identity of the subsequent phone. In general, use of the Context Dependent Recognizer produces higher recognition accuracy (59% correct identification of phones) than the Context Independent Recognizer (79% correct), at the cost of greater computation load. However, the lower accuracy of the Context Independent Recognizer was not important for this study, since only the pattern of recognizer errors, rather than the frequency of errors, was used in simulations of potential recognizers. By contrast, the Context Dependent Recognizer was used as an example of the performance of current state of the art recognizers. Detailed descriptions of the recognizers, together with a description of how the recognizer output was used to produce cues, is provided in the Appendix. Training and Testing During both training and testing, sentences were presented on a video monitor at a distance of roughly one meter from the subject. No audio signal was presented. Responses were recorded in writing on prepared forms. The CID and Clarke sentences were processed and used to familiarize the subjects with each presentation condition. During training, subjects were shown each sentence orthographically after their response was recorded. The same training sentences were presented to each subject. For testing, the Harvard sentences were used exclusively and no feedback was provided. These sentences are scored by keywords: five per sentence. Although the Harvard sentence lists have been equated for difficulty under acoustic presentation conditions, they are not necessarily of equal difficulty when presented with MCS. To equalize the average difficulty of the sentences used in the different test conditions, we used different sentence lists to represent a given test condition for different subjects. The large size of the Harvard corpus (720 sentences) generally allowed us to do this. However, subjects S2 and S6 were tested on the same lists in Phase II. Also, all subjects were tested on the same sentences in both the Speechreading Alone and Manual Cued Speech conditions. A minimum of 40 Harvard sentences (containing a total of 200 keywords) were presented to each subject in all conditions. Some baseline conditions in Phase I utilized 60 sentences (300 key words). We estimated that scores computed from 40 sentences would reduce the standard deviation of the test score to 3.5 percentage points or less, so that differences between scores for a given subject of ten percentage points or more would generally be statistically significant. Conditions and Results The SA (speechreading alone) and MCS (Manual Cued Speech) conditions were intended to provide reference scores for all other conditions tested. Comparisons with the SA condition estimate the benefit provided by each type of cue presentation. Comparisons with the MCS condition estimate the effects of the artificial display and the degradations introduced in the cues. Comparisons with the PSC (perfect synthetic cues) condition estimate the effects of the degradations introduced in the experimental conditions. The SA, MCS, and PSC conditions were tested at the beginning and end of each two-day session. The experimental conditions were tested in two blocks, each of which contained all of the experimental conditions. To control for the effects of learning, the order of testing the experimental conditions was randomized, with different orders used in each block. The order of testing conditions was the same for all subjects. Test Procedures In each phase, training was provided prior to testing on all conditions except for the second set of tests of the SA, MCS, and PSC conditions. In the first set of baseline tests, 10 training sentences preceded 20 testing sentences for the SA and MCS conditions. For the less familiar PSC condition, 30 training sentences were presented prior to the 20 test sentences. For the experimental conditions 10 training and 20 testing sentences were used in each block. No sentence was presented to any subject more than once. Except for S4, who was tested at home, subjects were tested individually in a sound-treated booth at M.I.T. Scoring and Analysis of Results Reception accuracy for the five keywords in each Harvard sentence was evaluated using strict scoring rules (plural nouns were not accepted for singular, past tense was not accepted for present, etc.) However, homophone substitutions were not counted as errors. An analysis of variance (ANOVA, e.g., Winer, 1971) was applied to the arc-sine transformed scores to identify the factors that played a statistically significant role in the observed pattern of scores. Results from each phase were analyzed separately. The two factors analyzed, Conditions and Subjects, were treated as fixed effects, although conclusions would be essentially unchanged if the Subjects factor were treated as a random effect. It was assumed that there were no differences among sentence lists. An F test with a significance level of 0.01 indicated that there were significant differences in scores between Conditions and between Subjects, but there was no significant interaction of Conditions X Subjects. Thus the observed differences between scores in the different conditions were, on average, independent of the subject tested. These results held true in both phases. For both phases, paired comparisons using t-tests at the two-tail 0.01 significance level were performed between scores in the PSC condition and in the experimental conditions, and between the MCS and PSC conditions, using Fisher's method. In Phase II, additional t-tests were used to compare scores in the REAL condition with those in other conditions. The results of these comparisons are discussed below. Phase I The experimental conditions in Phase I were concerned with the effects of recognizer errors and delays in the presentation of the cues. The conditions tested and average scores obtained are summarized in Table 2. Table 2 Conditions and keyword scores for Phase I Condition Error Delay Subject Rate(%) ms S1 S2 S3 S4 Avg SA 31 33 20 15 24.8 MCS 87 91 82 54 78.5 PSC 79 75 74 63 72.8 E10 10 0 70 72 55 38 58.8 E20 20 0 63 57 43 32 48.8 E20D1 20 33 54 55 47 40 49.0 E20D3 20 100 46 39 36 28 37.2 E10D5 10 165 40 44 31 24 34.8 Baseline Conditions. The baseline conditions used unaltered recordings of the speaker. In Phase I, six lists, containing a total of 300 keywords, were used in the SA and MCS conditions, with the same lists presented to all subjects.. Scores for SA (speechreading alone) fell well below 40% for all subjects and were lower than the scores obtained by each subject for all other test conditions. The range of scores is consistent with the range (16-34%) reported by Uchanski et al. (1994) for similar materials. Scores for MCS (Manual Cued Speech) were, except for S4, higher than the scores obtained in all other conditions, and were similar to those reported by Uchanski et al. for similar materials (68-93%). The PSC (Perfect Synthetic Cues) condition was intended to evaluate a cueing system that employs an ideal speech recognizer operating on the acoustic speech waveform. For each sentence, the cues were derived from the unaltered phone sequence marked by the trained human labellers, as described on page 10. Although one might expect this sequence to be phonetically perfect, the speaker did not produce all phones exactly as specified by the sentence orthography. Some speech sounds were omitted and others were slurred. Since the transcriptions were derived from acoustic waveforms, rather than the sentence orthography, the cue sequence contained a small number of omissions and some errors unavoidably introduced by the human labelers. Each subject was tested on four lists of sentences in the PSC condition, with different lists presented to each subject. Scores in the PSC condition were fairly high, typically only slightly lower those for MCS, except for S4, who performed better in the PSC than in the MCS condition. The difference between MCS and PSC scores was not statistically significant. If the unusual results obtained by S4 are omitted, scores in the PSC condition were slightly inferior to those in the MCS condition. In the remainder of this report, scores in the PSC condition are used as a reference to evaluate the effects of cue imperfections in other conditions. Errors. The E20 and E10 conditions tested reception when random errors were introduced into 20% and 10% of the phones, as described on page 31. Scores generally decreased as the proportion of errors increased, but the difference between scores in the PSC and E10 conditions was not statistically significant. This result is encouraging because it suggests that an automatic cueing system with 10% errors could provide benefits to cue receivers almost as well as a perfect system. However, increasing the errors by 10 percentage points (SDI, 20% errors) caused a significant decrease in keyword reception compared with PSC and E10. Errors and Fixed Delays. Three conditions examined the effect of combining recognizer errors and fixed presentation delays. In conditions E20D1, E20D3, and E10D5, the phone sequence contained 20% errors and a 33 ms delay, 20% errors and 100 ms delay, and 10% errors and a 165 ms delay, respectively. These delays range up to roughly half the average cue duration, since there were roughly 3.5 cues per second in these materials. Errors were introduced, and delays are specified relative to the beginning of the marked acoustic segment. Generally, as presentation delay increased, test scores decreased. However, the difference in scores between E20D1 and E20, both containing 20% errors, is small and not statistically significant, suggesting that small (33 msec) fixed delays have only minor effects. Some subjects found the 100 msec delay noticeable and bothersome and all found the 165 msec delay to be so. The decrease in scores between conditions E20D1 and E20D3 was statistically significant, but the difference in scores between E20D3 and E10D5 was not. Phase II Phase II testing was conducted roughly two months after Phase I and included six new experimental conditions, as well as the three baseline conditions (see Table 3). Table 3 Conditions and scores Phase II Condition Error Delay Subject S1 S2 S3 S4 SS Avg. SA 43 40 21 37 38 35.8 MCS 92 93 86 81 91 88.6 PSC 85 85 83 76 78 81.4 EODR 0 -33,0,+33 74 87 65 67 88 76.2 E10DR 10 -33,0,+33 71 81 67 60 70 69.9 E20DR 20 -33,0,+33 65 71 53 44 69 60.4 E20MK 20 0 61 61 48 46 61 55.4 E20IN 20 0 60 61 60 49 58 57.7 REAL 21 79 73 70 65 64 70.2 Baseline Conditions. In Phase II, four lists, containing 400 keywords, were used to test the SA, MCS and PSC conditions. A total of 13 lists were used to test the SA and MCS conditions, with each list presented to roughly half of the subjects in each condition. As in Phase I, different lists were presented to each subject in the PSC condition. Scores for these three conditions averaged roughly 10 points higher than in Phase I, and were somewhat higher than those reported by Uchanski et al. (1994). The three subjects who participated in both phases all obtained higher scores on these conditions than when tested on comparable conditions in Phase I. Averaged across subjects, scores in the PSC condition were the second highest of all conditions, and these scores were not statistically different from those of the MCS condition. The three subjects who participated in both Phases I and II improved more on the PSC condition than on the MCS or SA conditions, suggesting that they were not fully practiced with this novel display in the Phase I tests. Errors and Random Delays. These conditions evaluated the effects of combining errors (introduced as described on page 31) with the random presentation delays that are likely to occur in a real-time recognition system. Conditions EODR, E10DR, and E20DR, contained 0%, 10% and 20% errors respectively, and cues were presented with delays of 0, +1 and -1 frame (with equal probability) relative to the frame corresponding to the start of the acoustically labelled phone. The difference between scores obtained in the MCS and EODR conditions was not significant, nor was the difference between PSC and EODR. This result is consistent with that for the E20D1 condition in Phase I, and suggests that one frame (33 ms) of fixed or random delay has little effect. The differences between PSC and the other random delay conditions (E20DR and E10DR) were significant, as was that between EODR and E20DR, but the difference between scores in the E20DR and E10DR conditions was not. For the most part, as with the fixed delay conditions, speech reception scores decreased as the error rate increased. Marked Errors. Some speech recognizers are capable of both identifying the spoken phone and estimating whether the recognition decision is likely to be correct. The E20MK condition explored whether such indications could assist the cue receiver to deal with recognition errors. Phonetic transcripts were prepared in which 20% of the phones were substituted or deleted. On half of the substitution errors, the associated cue handshape was outlined with a red square to indicate to the subject that it might be wrong. Since the recognizer would be expected to report erroneously that some correctly identified phones were incorrect, the handshapes corresponding to an equal number of correctly recognized phones were also outlined. Most of the subjects obtained scores in the E20MK conditions that were somewhat lower than those in the E20DR condition, which had the same error rate, but also included jitter in time of presentation. This suggests that the subjects may have treated all marked cues as erroneous. It seems likely that subjects did not receive adequate training to develop successful strategies for making optimum use of the error indications. Insertions. The error conditions described thus far simulated the tendency of automatic recognizers to fail to detect the presence of speech elements (deletion errors) and to confuse speech elements for one another (substitution errors), but not the tendency to insert elements where they do not occur in the phonetic sequence (insertion errors). The sum of deletion and insertion rates is often roughly constant for a particular type of recognizer (Schwartz, Chow, Kimball, Roucos, Krasner, & Makhoul, 1985). To examine the effect of such erroneous insertions, in condition E20IN the deletion rate for each input phone was halved and insertions were introduced at the same rate as deletions. The identity of the inserted phone was selected among the remaining phones with equal probability of occurrence. The duration of the inserted phone was arbitrarily set equal to 70% of the average duration of that phone in the corpus. To accommodate the insertion, the preceding and following phones were shortened. Additional details are provided in Bratakos (1995). Scores in the E20IN conditions were roughly equal to those in the E20MK condition, i.e. somewhat lower than those in the E20DR condition. Two factors seem likely to account for the potency of insertion errors. First, the erroneous insertion of a phone altered the duration and time of occurrence of adjacent phones, and delayed the presentation of the following phone. Second, whereas substitutions and deletions are likely occur naturally in the presentation and reception of MCS, insertions represent a new phenomena for cue receivers. Also, as with the E20MK condition, subjects may not have had adequate training to develop strategies to deal with the insertion of erroneous cues. Context-Dependent Recognizer. In this condition (REAL) cues were produced by a state-of-the-art phonetic recognizer trained on the talker who produced the materials used in the speech reception tests. This recognizer (described in the Appendix) operated off-line rather than in real-time, and produced a phone sequence (including time markings) that included substitution, deletion, and insertion errors. This phone sequence was used to produce a cue sequence according to the rules described on page 11, using the times of occurrence of phones estimated by the recognizer rather than the times included in the phonetic transcriptions of the acoustic waveforms. No attempt was made to delay the presentation of the cues in accordance with the time required by the recognition process. Scores for the REAL condition were fairly similar to those observed in the E10DR condition which simulated a 10% error rate and included 33 ms of cue presentation jitter.
منابع مشابه
Persian Cued Speech: The Effect on the Perception of Persian Language Phonemes and Monosyllabic Words with and without Sound in Hearing Impaired Children
Objectives: This paper studies the effect of Persian Cued Speech on the perception of Persian language phonemes and monosyllabic words with and without sound in hearing impaired children. Cued Speech is a sound based mode of communication for hearing impaired people that is comprised of a limited series of hand complements and the normal pattern of speech. And it is shown that it effectively ca...
متن کاملA Persian Cued Speech Website Fromthe Deaf Professionals’ Views
Objectives: Increasingly people are using the internet to find information about medical and educational issues and one of the simplest ways to obtain information is internet. Persian Cued Speech is a very new system to Iranian families with deaf child and the professionals and a few educators have enough knowledge about it, so the purpose of this study was to introduce Persian Cued Speech webs...
متن کاملAutomatic Generation of Cued Speech for The Deaf: Status and Outlook
Manual Cued Speech is a system of hand gestures designed to help deaf speechreaders distinguish among ambiguous speech elements. We have developed a computerized cueing system that uses automatic speech recognition to determine and display cues to the cue receiver. Keyword scores of 66% in low-context sentences have been obtained with this system, almost double the speechreading-alone scores. W...
متن کاملبررسی اثربخشی گفتار نشانهدار بر مهارتهای زبانی حفظ موضوع ، اطلاعات اصلی و توالی وقایع داستان در دانشآموزان کم شنوای پیش زبانی با عمل کاشت حلزون دیرهنگام
Objective: Cochlear Implant has very positive impact on expressive language growth of children with severe impaired hearing and the effectiveness of Cued Speech has been studied in several investigations. The purpose of this study was to assess the effectiveness of using Cued Speech on topic maintenance, basic information and sequence events of the story in the late cochlear implanted prelingua...
متن کاملTactual Cued Speech as a Supplement to Speechreading
The Cued Speech method devised by Cornett (1967) has proven to be a highly effective means of supplementing the information available through speechreading. For example, highly trained deaf receivers of Cued Speech are able to achieve nearly perfect reception of cued conversational sentences (e.g., Nicholls & Ling, 1982; Uchanski et al., 1992). The success of this method, combined with recent a...
متن کاملAutomatic speech recognition to aid the hearing impaired: prospects for the automatic generation of cued speech.
Although great strides have been made in the development of automatic speech recognition (ASR) systems, the communication performance achievable with the output of current real-time speech recognition systems would be extremely poor relative to normal speech reception. An alternate application of ASR technology to aid the hearing impaired would derive cues from the acoustical speech signal that...
متن کامل